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Abstract 

The k-d tree was one of the first spatial data structures proposed for nearest neighbor search. Its 
efficacy is diminished in high-dimensional spaces, but several variants, with randomization and overlap- 
ping cells, have proved to be successful in practice. We analyze three such schemes. We show that the 
probability that they fail to find the nearest neighbor, for any data set and any query point, is directly 
related to a simple potential function that captures the difficulty of the point configuration. We then 
bound this potential function in two situations of interest: the first, when data come from a doubling 
measure, and the second, when the data are documents from a topic model. 



1 Introduction 



The problem of nearest neighbor search has engendered a vast body of algorithmic work. In the most basic 
formulation, there is a set S of n points, typically in an Euclidean space M. d , and any subsequent query point 
must be answered by its nearest neighbor (NN) in S. A simple solution is to store S as a list, and to address 
queries using a linear-time scan of the list. The challenge is to achieve a substantially smaller query time 
than this. 

We will consider a prototypical modern application in which the number of points n and the dimension 
d are both large. The primary resource constraints are the size of the data structure used to store S and the 
amount of time taken to answer queries. For practical purposes, the former must be 0(n), or maybe a little 
more, and the latter must be o(n). Secondary constraints include the time to build the data structure and, 
sometimes, the time to add new points to S or to remove existing points from S. 

A major finding of the past two decades has been that these resource bounds can be met if it is enough 
to merely return a c- approximate nearest neighbor, whose distance from the query is at most c times that of 
the true nearest neighbor. One such method that has been successful in practice is locality sensitive hashing 
(LSH), which has space requirement n 1+p and query time 0(n p ), for p w 1/c 2 (Andoni and Indyk 20081. 



Another such method is the balanced box decomposition tree, which takes O(n) s pace and answers queries 



with an approximation factor c = 1 + e in 0((6/e) d log n) time (Arya et al. 1998) 



In the latter result, an exponential dependence on dimension is evident, and indeed this is a familiar blot 
on the nearest neighbor landscape. One way to mitigate the curse of dimensionality is to consider situations 
in which data have low intrinsic dimension d Q , even if they happen to lie in M. d for d 3> d Q or in a general 
metric space. A common assumption is that the data are drawn from a doubling measure of dime nsion d Q (or 



equivalently , have expansion rate 2 °); this is defined in Section 4.1 below. Under this condition, Karger and 
Ruhl ( 2002 ) have a scheme that gives exact answers to nearest neighbo r queries in time Q(2 3d ° l og n) , using 
a data structure of size O(2 3do n). The more recent cover tree algorithm (Beygelzimer et al. 2006 ), which has 



been used quite widely, creates a data structure in space 0(n) and answers queries in time O{2 do log n). There 



is also work that combines intrinsic dimension and approximate search. The navigating net ( |Krauthgamer 
and Lee 2004), given data from a metric space of doubling dimension d , has size 0(2°( d °^n) and gives a 
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Figure 1: Left: A k-d tree, with axis-parallel splits. Right: A variant in which the split directions are chosen 
randomly from the unit sphere. 



(1 + e)-approximate answer to queries in time 0(2°( d °) logn+ (1/e) ^ '); the crucial advantage here is that 
doubling dimension is a more general and robust notion than doubling measure. 

Despite these and many other results, there are two significant deficiencies in the nearest neighbor litera- 
ture that have motivated the present paper. First, existing analyses have succeeded at identifying, for a given 
data structure, highly specific families of data for which efficient exact NN search is possible — for instance, 
data from doubling measures — but have failed to provide a more general characterization. Second, there 
remains a class of nearest neighbor data structures that are popular and successful in practice, but that have 
not been analyzed thoroughly. These structures combine classical k-d tree partitioning with randomization 
and overlapping cells, and are the subject of this paper. 



1.1 Three randomized tree structures for exact NN search 



The k-d tree is a partition of R into hyper-rectangular cells, based on a set of data points (Bentley 19751. 
The root of the tree is a single cell corresponding to the entire space. A coordinate direction is chosen, and 
the cell is split at the median of the data along this direction (Figure [TJ left). The process is then recursed 
on the two newly created cells, and continues until all leaf cells contain at most some predetermined number 
n a of points. When there are n data points, the depth of the tree is at most about \og(n/n ). 

Given a k-d tree built from data points S, there are several ways to answer a nearest neighbor query q. 
The quickest and dirtiest of these is to move q down the tree to its appropriate leaf cell, and then return the 
nearest neighbor in that cell. This defeatist search takes time just 0(n o + log(n/n Q )), which is O(logn) for 
constant n D . The problem is that q's nearest neighbor may well lie in a different cell, for instance when the 
data happen to be concentrated near cell boundaries. Consequently, the failure probability of this scheme 
can be unacceptably high. 

Over the years, some simple tricks have emerged, from various sources, for reducing the failure probability. 



(2004|, who show experimentally that the resulting algorithms are 



These are nicely laid out by Liu et al. 
effective in practice. 

The first trick is to introduce randomness into the tree. Drawing inspiration from locality-sensitive 



hashing, Liu et al. (2004) suggest preprocessing the data set S by randomly rotating it, and then applying a 
k-d tree (or related tree structure). This is rather like splitting cells along random directions as opposed to 
coordinate axes (Figure[l] right). In this paper, we consider a data structure that uses random split directions 
as well as a second type of randomization: instead of putting the split point exactly at the median, it is 
placed at a fractile chosen uniformly at random from the range [1/4, 3/4]. The resulting structure (Figure [2]) 
is almost exactly the random projection tree (or RP tree) of Dasgupta and Freund ( 2008 1 . That earlier work 
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Function MakeRPTree (S) 

If \S\ <n : return leaf containing S 

Pick U uniformly at random from the unit sphere 

Pick P uniformly at random from [1/4,3/4] 

Let v be the /3-fractile point on the projection of S onto U 
Rule (a;) = (LEFT if x-U <v, otherwise RIGHT) 
Lef tSubtree = MakeRPTree({a; G S : rule(a;) = left}) 
RightSubtree = MakeRPTree({x G S : rule(x) = RIGHT}) 
Return (Rule(-), LeftSubtree, RightSubtree) 



Figure 2: The random projection tree (RP tree) 



showed that in RP trees, the diameters of the cells decrease (down the tree) at a rate depending only on the 
intrinsic dimension of the data. It is a curious result, but is not helpful in analyzing nearest neighbor search, 
and in this paper we develop a different line of reasoning. Indeed, there is no point of contact between that 
earlier analysis and the one we embark upon here. 



A second trick suggested by Liu et al. ( 2004 ) for reducing failure probability is to allow overlap between 
cells. This was also proposed in earlier work of Maneewongvatana and Mount (2001). Once again, each cell 



C is split along a direction U (C) chosen at random from the unit sphere. But now, three split points are 
noted: the median m(C) of the data along direction U, the (1/2) — a fractile value 1(C), and the (1/2) + a 
fractile value r(C). Here a is a small constant, like 0.05 or 0.1. The idea is to simultaneously entertain a 
median split 

left = {x : x ■ U < m(C)} right = {x : x ■ U > m(C)} 
and an overlapping split (with the middle 2a fraction of the data falling on both sides) 

left = {x : x ■ U < r(C)} right = {x : x ■ U > 1(C)}. 



In the spill tree (Liu et al. 2004), each data point in S is stored in multiple leaves, by following the overlapping 



splits. A query is then answered defeatist-style, by routing it to a single leaf using median splits. 

Both the RP tree and the spill tree have query times of 0(n + \og(n/n )), but the latter can be expected 
to have a lower failure probability, and we will see this in the bounds we obtain. On the other hand, the 
RP tree requires just linear space, while the size of the spill tree is 0(n 1 /( 1_ls ' 1+2Q ')). When a = 0.05, for 
instance, the size is 0(n 1159 ). 

In view of these tradeoffs, we consider a further variant, which we call the virtual spill tree. It stores each 
data point in a single leaf, following median splits, and hence has linear size. However, each query is routed 
to multiple leaves, using overlapping splits, and the return value is its nearest neighbor in the union of these 
leaves. 

The various splits are summarized in Figure [3j and the three trees use them as follows: 

Routing data Routing queries 



RP tree 
Spill tree 
Virtual spill tree 



Perturbed split 
Overlapping split 
Median split 



Perturbed split 

Median split 
Overlapping split 



One small technicality: if, for instance, there are duplicates among the data points, it might not be 
possible to achieve a median split, or a split at a desired fractile. We will ignore these discretization 
problems. 



1.2 Analysis of failure probability 

Our three schemes for nearest neighbor search — the RP tree and the two spill trees- 



simple and unified framework. Pick any data set X\ 



and any query q G 



-can be analyzed in a 
l d . The probability of 
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median split 



perturbed split 



overlapping split 



.1-/3. 



+ a- 



+ a. 



Figure 3: Three types of split. The fractions refer to probability mass, a is some constant, while /3 is chosen 
uniformly at random from [1/4,3/4]. 

failure, of not finding the nearest neighbor, can be shown to be directly related to the quantity 

1 ,A 



^(q,{x!,...,x n }) 



i=2 



II — || - 



where x^,x^ 2 ), ■ ■ ■ denotes an ordering of the Xi by increasing distance from q. For RP trees, the failure 
probability is proportional to <I>log(l/<I>) (Theorem[7|; for the two spill trees, it is proportional to $ (Theo- 
rem [6]). The results extend easily to the problem of searching for the k nearest neighbors. Moreover, these 
bounds are roughly tight: a failure probability proportional to 4> is inevitable unless there is a significant 
amount of collinearity within the data (Corollary [5]). 

Let's take a closer look at this potential function. If <I> is close to 1, then all the points are roughly the 
same distance from q, and so we can expect that the NN query is not easy to answer. On the other hand, 
if $ is close to zero, then most of the points are much further away than the nearest neighbor, so the latter 
should be easy to identify. Thus the potential function is an intuitively reasonable measure of the difficulty 
of NN search. 

This general characterization of data configurations amenable to efficient exact NN search, by the three 
data structures, is our main result. Earlier work has looked at other data structures, and has only provided 
guarantees for very specific families of data. To illustrate our theorem, we bound $ for two commonly-studied 
data types. In either scenario, the queries are arbitrary. 



When x\, . . . , x n are drawn i.i.d. from a doubling measure (Section 4.1 1. As we discussed earlier, this 



is the assumption under which many other results for exact NN search have been obtained. 



When X\ , . . . « X ji £11" 6 documents drawn from a topic model (Section |4.2[) 



For doubling measures of intrinsic dimension e? , we show that the spill tree is able to answer exact nearest 
neighbor queries in time O(d ) do + 0(log n), with a probability of error that is an arbitrarily small constant, 



while the RP tree is slower by only a logarithmic factor (Theorem 10). These are close to the best results 
that have been obtained using other data structures. (The failure probability is over the randomization in 
the tree structure, and can be further reduced by building multiple trees.) We chose the topic model as an 
example of a significantly harder case: its data distribution is more concentrated, in the sense that there are 
a lot of data points that are only slightly further away than the nearest neighbor. The resulting savings are 
far more modest though non- negligible: for large n, the time to answer a query is roughly n ■ 2~ (^\ where 
L is the expected document length. 

In some situations, the time to construct the data structure, and the ability to later add or remove data 
points, are significant factors. It is readily seen that the construction time for the spill tree is proportional 
to its size, while that of the RP tree and the virtual spill is O(nlogn). Adding and removing points is also 
easy: all guarantees hold if these are performed locally, while rebuilding the entire data structure after every 
0(n) such operations. 
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2 A potential function for point configurations 



To motivate the potential function <E>, we start by considering what happens when there are just two data 
points and one query point. 

2.1 How random projection affects the relative placement of three points 

Consider any three points q,x,y £ M. d , such that x is closer to q than is y; that is, \\q — x\\ < \\q — y\\. 

Now suppose that a random direction U is chosen from the unit sphere S^ -1 , and that the points are 
projected onto this direction. What is the probability that y falls between q and x on this line? The 
following lemma answers this question exactly. An approximate solution, with different proof method, was 



given earlier by Kleinberg (19971 



Lemma 1 Pick any q,x,y £ K d with \\q — x\\ < \\q — y\\. Pick a random unit direction U . Then 
Prjj(y ■ U falls (strictly) between q ■ U and x ■ U) — 




(q-x)-(y- x) 



2 ' 



\\q - x\\ \\y - x\\ 



Proof: We may assume that U is drawn from N(0,ld), the d-dimensional Gaussian with mean zero and 
unit covariance. This gives exactly the right distribution if we scale U to unit length, but we can skip this 
last step since it has no effect on the question we are considering. 

We can also assume, without loss of generality, that q lies at the origin and that x lies along the (positive) 
Xi-axis: that is, q = and x = ||x||ei. It will then be helpful to split the direction U into two pieces, 
its component U± in the xi-direction, and the remaining d — 1 coordinates Ur. Likewise, we will write 

y = (vuVr)- 

If Vr = then x, y, and q are collinear, and the projection of y cannot possibly fall between those of x 
and q. In what follows, we assume yn ^= 0. 
Let E denote the event of interest: 

E = y ■ U falls between q ■ U (that is, 0) and x ■ U (that is, ||x||t7i) 
= Vr ■ Ur falls between —y\U\ and (||x|| — y\)U\ 

The interval of interest is either {-y\\U\\, (\\x\\ - yi)\Ui\), if U ± > 0, or (-(||x|| - yi)|t/i|,2/i|J7i|), if Ui < 0. 
To simplify things, yn ■ Ur is independent of U\ and is distributed as N(0, \\yR.\\ 2 ), which is symmetric and 
thus assigns the same probability mass to the two intervals. We can therefore write 

Pru(E) = Pr^Pri^-yilE/il < y R ■ U R < (\\x\\ - yi)\Ui\). 

Let Z and Z' be independent standard normals N(0, 1). Since U\ is distributed as Z and yR-Un is distributed 
as \\vr\\Z', 

PME) = Pr(-yi\Z\<\\y R \\Z' < (\\x\\ - yi )\Z\) = Pr £ (- 
Now Z' j\Z\ is the ratio of two standard normals, which has a standard Cauchy distribution. Using the 




5 



formula for a Cauchy density, 

Pr(E) 



(\\ x \\-vi)/\\vr\\ 
-yi/\\yn\\ tt(1 + w 2 ) 

ifarctanfM^V-tanff^ 

* V V toll / WIvrW 

Cretan- W 



I" ' II2/II 2 - 2/1IM! 



i . /INI h\\ 2 -yl 

— arcsm 1 



7T 



It'll V Hyll 2 + IMI 2 -2yilM| / ' 



which is exactly the expression in the lemma statement once we invoke yi = (y ■ x)/\\x\\ and factor in our 
assumption that q = 0. □ 

To simplify the expression, define an index of the collinearity of q, x, y to be 

x \(q — x) ■ (y — x)\ 

This value, in the range [0, 1], is 1 when the points are collinear, and when q — x is orthogonal to x — y. 
Corollary 2 Under the conditions of Lemma^ 

' "' / ~ X ^yJl-co\l(q,x,y) 2 < Pru{y ■ U falls between q ■ U and x ■ U) < ' !<i ' ~ ' " 



* h-y\\ ' ' 2 m-y\\ 

Proof: Apply the inequality 6 > sinfl > 29 /it for all < 6 < vr/2. □ 

The upper and lower bounds of Corollary [2] are within a constant factor of each other unless the points 
are approximately collinear. 

2.2 By how much does random projection separate nearest neighbors? 

For a query q and data points X\, . . . ,x n , let xri^,X(2), ■ ■ ■ denote a re-ordering of the points by increasing 
distance from q. Consider the potential function 

frf s \\ 1 >^ l|g-S(i)ll 

= n^Jq-x^- 

Theorem 3 Pick any points q,xi, . . . ,x n £ M. d . If these points are projected to a direction U chosen at 
random from the unit sphere, then 

E(j(fraction of the projected Xi that fall between q and < - <&(<?, {%i, ■ ■ ■ , x n }). 

Proof: Let Zi be the event that x^ falls between q and x^ in the projection. By Corollary [5J 

PM*) < 

2 \\q-x(i)\\ 

The lemma now follows by linearity of expectation. □ 



G 



The upper bound of Theorem [3] is fairly tight, as can be seen from Corollary [2j unless there is a high 
degree of collinearity between the points. 

In the tree data structures we analyze, most cells contain only a subset of the data {x\, . . . ,x n }. For a 
cell that contains m of these points, the appropriate variant of $ is 

Corollary 4 Pick any points q, x\, . . . , x n and let S denote any subset of the X{ that includes xm. If q and 
the points in S are projected to a direction U chosen at random from the unit sphere, then for any < a < 1, 

Prjj{at least an a fraction of S falls between q and X(i) when projected) < — $151 (q, {xi, . . . , x n }). 

2a 



Proof: This follows immediately by applying Theorem [3] to S, noting that the corresponding value of $ is 
maximized when S consists of the points closest to q, and then applying Markov's inequality. □ 

2.3 Extension to k nearest neighbors 

If we are interested in finding the k nearest neighbors, a suitable generalization of $ m is 
- , r ix 1 (119-^(1)11 + --- + h-X(k)\\)/k 



Theorem 5 Pick any points q,x\, . . . ,x n and let S denote any subset of the Xi that includes xm, . . . , xn,\. 
Suppose q and the points in S are projected to a direction U chosen at random from the unit sphere. Then, 
for any < a < 1, the probability (over U) that in the projection, there is some 1 < j < k for which > am 
points lie between xu\ and q is at most 

provided k < a\S\ + 1. 

PROOF: Set m = \S\. As in Corollary |4j the probability of the bad event is maximized when S = 
{xm , . . . , X( m )}, so we will assume as much. 

For any 1 < j < k, let Nj denote the number of points in {x^+i), ■ ■ ■ ,xr m \} that fall (strictly) between 
q and x^ in the projection. Reasoning as in Theorem |3j we have 

Pr^ > am- (A- 1)) < ^ s < , \ ^ f ^ ^ ■ 

3 ~ ~ am-(fc-l) " 2(am-(k-l)) .^ +1 \\q-x (i) \\ 

Taking a union bound over all 1 < j < k, 

FMH<J<k:N 3 >am-(k~l)) < ^ _ (fc _ 1)} 

k 

= 77, 77 777—7®k,m{q,{ x U---i x n}), 

2(a — (fc — l)/m) 

as claimed. □ 
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2.4 Bounds on $ 



The results so far suggest that $ is closely related to the failure probabilities of the randomized search trees 
we have described. In the next section, we will make this relationship precise. We will then give bounds on 
$ for various types of data. Here is a brief preview: for large enough m, very roughly 



3 Randomized partition trees 

We'll now see that the failure probability of the random projection tree is proportional to <I>ln(l/<I>), while 
that of the two spill trees is proportional to <5. We start with the second result, since it is the more 
straightforward of the two. 

3.1 Randomized spill trees 

In a randomized spill tree, each cell is split along a direction chosen uniformly at random from the unit 
sphere. Two kinds of splits are simultaneously considered: (1) a split at the median (along the random 
direction), and (2) an overlapping split with one part containing the bottom 1/2 + a fraction of the cell's 
points, and the other part containing the top 1/2 + a fraction, where < a < 1/2 (recall Figure [3]). 

We consider two data structures that use these splits in different ways. The spill tree stores each data 
point in (possibly) multiple leaves, using overlapping splits. The tree is grown until each leaf contains at 
most n Q points. A query is answered by routing it to a single leaf, using median splits, and returning the 
NN in that leaf. 

The time to answer a query is just O(n + log(n/n )), but the space requirement of this data structure 
is super-linear. Its depth is I = log-jy^ n/n levels, where f3 = (1/2) + a, and thus the total size is 



We will take n to be a constant independent of n, so this size is O(n losi ^ 2 ). When a = 0.05, for instance, 
the size is 0(n 1159 ). When a = 0.1, it is 0(n 1357 ). 

A virtual spill tree stores each data point in a single leaf, using median splits, once again growing the tree 
until each leaf has n or fewer points. Thus the total size is just 0(n) and the depth is \og 2 {n/n ). However, 
a query is answered by routing it to multiple leaves using overlapping splits, and then returning the NN in 
the union of these leaves. 

Theorem 6 Suppose a randomized spill tree is built using data points {x±, x„}, to depth I = log 1 /p(n/n ), 
where f3 = (1/2) + a for regular spill trees and (3 = 1/2 for virtual spill trees. If this tree is used to answer a 
query q, then the probability (over randomization in the construction of the tree) that it fails to return xm 




\J m l / d o doubling measure of intrinsic dimension d 

topic model with expected document length L 




is at most 



The probability that it fails to return the k > 1 nearest neighbors £(1), • • • ,X(k) is at most 




provided k < an /2. 
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Proof: Let's start with the regular spill tree. Consider the internal node at depth i on the root-to-leaf 
path of query q; this node contains /3 l n data points, for j3 = (1/2) + a. What is the probability that q gets 
separated from xm when the node is split? This bad event can only happen if q and Xm lie on opposite 
sides of the median and if x^) is transmitted only to one side of the split, that is, if at least a fraction of 
the points lie between X(%) and the median. This means that at least an a fraction of the cell's projected 
points must fall between q and xn\, which occurs with probability at most (l/2a)&pi n (q, {xi, . . . ,x n }) by 
Corollary [3] The lemma follows by summing over all levels i. 

The argument for the virtual spill tree is identical, except that we use (3 = 1/2 and we swap the roles of 
q and x^y, for instance, we consider the root-to-leaf path of 

The generalization to k nearest neighbors is immediate for spill trees. The probability of something going 
wrong at level i of the tree is, by Theorem [5j at most 

k k 

2{a-{k-l)/n ) k ^' n ~ a k ^ n ' 

Virtual spill trees require a slightly more careful argument. If the root-to-leaf path of each xq-), for 1 < j < k, 
is considered separately, it can be shown that the total probability of failure at level i is again bounded by 
the same expression. □ 

As we mentioned earlier, we will encounter two functional forms of $ m : either l/m 1 ^ , where d D is 
a notion of intrinsic dimension, or a small constant \j\J~L. In the former case, the failure probability of 
the spill tree is roughly 1/ \an 1 J d °) 1 and in the latter case it is {l/(a\/~L)) log(n/n ). Further details are in 
Sections O and H21 



3.2 Random projection trees 

In an RP tree, a cell is split by choosing a direction uniformly at random from the unit sphere projecting 
the points in the cell onto that direction, and then splitting at the j3 fractile, for (3 chosen uniformly at random 
from [1/4, 3/4]. As in a fc-d tree, each point is mapped to a single leaf. Likewise, a query point is routed to 
a particular leaf, and its nearest neighbor within that leaf is returned. 

In many of the statements below, we will drop the arguments (g, {xi, . . . ,x n }) of $ in the interest of 
readability. 

Theorem 7 Suppose an RP tree is built using points {xi, . . . , x„} and is then used to answer a query q. 
The probability (over the randomization in tree construction) that it fails to return the nearest neighbor of q 
is at most 

i 2e 

5>^" ln $^~' 

i=o £™ 

where /3 = 3/4 and I = \og 1 /p(n/n ). The probability that it fails to return the k nearest neighbors of q is at 
most 



K ^k.j3 l n 



2e \ + 16(fc- 1) 

k&k Ri„ / Tin 



Proof: Consider any internal node of the tree that contains q as well as m of the data points, including 
£(!). What is the probability that the split at that node separates q from X(i)? To analyze this, let F denote 
the fraction of the m points that fall between q and xm along the randomly-chosen split direction. Since 
the split point is chosen at random from an interval of mass 1/2, the probability that it separates q from 
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aim is at most F/(l/2). Integrating out F, we get 

f 1 f 
Pr((7 is separated from < / Pr(i r = f)—j- df 

Jo 1/2 



2 / Pr(F>/)d/ 
Jo 

1 / 



< 2 / min ( 1, ^ ) d/ 



o 



= 2/ d/ + 2/ |^d/ = * m ln£, 

JO J$ m /2 Z / v m 

where the second inequality uses Corollary [4] 

The lemma follows by taking a union bound over the path that conveys q from root to leaf, in which the 
number of data points per level shrinks geometrically, by a factor of 3/4 or better. 

The same reasoning generalizes to k nearest neighbors. This time, F is defined to be the fraction of the 
m points that lie between q and the furthest of xn\, . . . ,x^k) along the random splitting direction. Then q 
is separated from one of these neighbors only if the split point lies in an interval of mass F on either side of 
q, an event that occurs with probability at most 2F/(l/2). Using Theorem[5j 

Pr(g is separated from some lEm, 1 < j < k) 
< 



4 f Pr(F > /) df 
Jo 



~ Jo \ '2(f-(k-l)/m)J 

^ 4 / d/ + 4/ — fc - m df 

Jo J(fe$ fc , m /2)+(fe-l)/m 2 (/ - ( k - l)/ m ) 

, 2e 4(fc-l) 

< 2fc$ fe , m ln— — + ^ -i, 
and as before, we sum this over a root-to-leaf path in the tree. □ 

3.3 Is randomization necessary? 

The tree data structures we have studied make crucial use of random projection for splitting cells. It would 
not suffice to use coordinate directions, as in fc-d trees. 

To see this, consider a simple example. Let q, the query point, be the origin, and suppose the data points 
xi, . . . ,x„ £ R d are chosen as follows: 

• Xi is the all-ones vector. 

• Each Xi,i > 1, is chosen by picking a coordinate at random, setting its value to M, and then setting 
all remaining coordinates to uniform-random numbers in the range (0, 1). Here M is some very large 
constant. 

For large enough M, the nearest neighbor of q is x\. By letting M grow further, we can let $(q, {x\ 1 . . . , x n }) 
get arbitrarily close to zero, which means that our random projection methods will work admirably. However, 
any coordinate projection will create a disastrously large separation between q and x\: on average, a (1 — l/d) 
fraction of the data points will fall between them. 
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4 Bounding $ 



The exact nearest neighbor schemes we analyze have error probabilities related to $, which lies in the range 
[0, 1]. The worst case is when all points are equidistant, in which case $ is exactly 1, but this is a pathological 
situation. Is it possible to bound $ under simple assumptions on the data? 

In this section we study two such assumptions. In each case, query points are arbitrary, but the data are 
assumed to have been drawn i.i.d. from an underlying distribution. 

4.1 Data drawn from a doubling measure 

Suppose the data points are drawn from a distribution fi on R d which is a doubling measure: that is, there 
exist a constant C > and a subset X C R d such that 

fJ,(B(x, 2r)) < C ■ fi{B(x, r)) for all x e X and all r > 0. 

Here B(x, r) is the closed Euclidean ball of radius r centered at x. To understand this condition, it is helpful 
to also look at an alternative formulation that is essentially equivalent: there exist a constant d > and a 
subset X C R d such that for all x £ X 7 all r > 0, and all a > 1, 

fi(B(x,ar)) < a d ° ■ fj,(B(x,r)). 

In other words, the probability mass of a ball grows polynomially in the radius. Comparing this to the 
standard formula for the volume of a ball, we see that the degree of this polynomial, d Q (which is log 2 C), 
can reasonably be thought of as the "dimension" of the measure \i. 

Theorem 8 Suppose /j, is continuous on R d and is a doubling measure with dimension d > 2. Pick any 
q G X and draw x\, . . . ,x n independently at random from [i. Pick any < S < 1/2. Then with probability 
at least 1 — 3(5 over the choice of the x i} for all 2 < m < n, 

$m(tf,{zi,...,a: n }) < 6 — ln- 



Proof: We will consider a collection of balls B Q , B\, B2, . ■ . centered at q, with geometrically increasing 
radii r OJ ri, ri, ■ ■ ., respectively. For i > 1, we will take ri = 2V Q . Thus by the doubling condition, fJ-(Bi) < 
C' L n{B ), where C = 2 d ° > 4. 

Define r Q to be the radius for which fi(B(q,r )) = (1/n) ln(l/<5). This choice implies that x^ is likely to 
fall in B D : when points X = {x\, . . . , x n } are drawn randomly from 

Pr(no point falls in B a ) = (1 - [i(B )) n < S. 

Next, for i > 1, the expected number of points falling in ball Bi is at most nC l n{B ) = C l ln(l/<$), and by 
a multiplicative Chernoff bound, 

Pi(\Xr\Bi\ > 2nCV(B )) < exp(-(nC>(S )/3)) = 5 C< / 3 < S w / 3 . 
Summing over all i, we get 

Pr(3i> 1 : \XC\Bi\ > 2n&n{B )) < 25 c/3 < 26. 

We will henceforth assume that x^ lies in B Q and that each Bi has at most 2nn{B )C l = 2C % ln(l/5) points. 
Pick any 2 < m < n, and recall the expression for $: 

* < s w 1 V" llg- g (i)ll 
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Once Xm is fixed, moving other points closer to q can only increase Therefore, the maximizing configu- 
ration has 2nfi(B )C points in B\, followed by 2nn(B )C 2 points in B2, and then 2nfj,(B )C 3 points in B3, 
and so on. Each point in Bj \ Bj-i contributes at most 1/2- 7-1 to the <£> summation. 
Under the worst-case configuration, points xru, . . . ,xi m ) lie within B^, for I such that 



2n^i(B )C i - 1 < m < 2n^{B )C l 



(*) 



We then have 



< 



< 



< 



1 

m 
1 

m 
1 

m 
1 

m 
1 

m 
1 

m 



3=2 



1 



£-1 



j=2 



\x n 
21 - l 



l-yngj-il 



+ (m-|Jf nB/_i| 

(m-ixns^ii) 




2 *-i 



m 

2 7IIT 

rn 
2F 1 



y,\xnBj\ 



3=1 



2-' 



2n//(fl ) ^ 



4nn(B ) 



2m 



where the last inequality comes from (*). To lower-bound 2 l , we again use (*) to get > m/(2n//(S )), 
whereupon 

I/I0B2C / m xl/log 2 C 



2 £ > 



/ m 

i,2n M (B )J = UMl/^J 

and we're done. □ 

This extends easily to the potential function for k nearest neighbors. 

Theorem 9 Under the same conditions as Theorem^ for any k > 1, we have 



$k,m(Q,{ x i>---> x n}) < 6 I — max ( k, In ^ 

1 m \ d 



l/d 



Proof: The only big change is in the definition of r ; it is now the radius for which 

4 / 1\ 
mB ) — -max fc, In- . 

n \ SJ 

Thus, when drawn independently at random from /i, the expected number of them that fall 

in B a is at least 4fc, and by a multiplicative Chernoff bound is at least k with probability > 1 — S. 

The balls B\, B2, . . . are defined as before, and once again, we can conclude that with probability > 1 — 2<5, 
each Bi contains at most 2nC l n{B ) of the data points. 

Any point xu-\ $jL B lies in some annulus Bj \ Bj_i, and its contribution to the summation in <&k,m is 

(llg-^(i)il + + < 1 

ll?-^)ll " 
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The relationship (*) and the remainder of the argument are exactly as before. □ 

We can now give bounds on the failure probabilities of the three tree data structures. 

Theorem 10 There is an absolute constant c for which the following holds. Suppose [i is a doubling 
measure on W 1 of intrinsic dimension d > 2. Pick any query q € X and draw X\,.. .,x n independently from 
/i. Then with probability at least 1 — 35 over the choice of data: 

(a) For either variant of the spill tree, if k < an a /2, 

Pr( spill tree fails to return k nearest neighbors) < — - ( — 

a \ 

(b) For the RP tree with n a > c (3k) d ° m&x(k, In 1/5), 

Pr(i?P tree fails to return k nearest neighbors) < c k(d + lnn ) 
These probabilities are over the randomness in tree construction. 



8max(/fc,ml/<5) x 1/d ° 



Proof: These bounds follow immediately from Theorems [6j [7J and[9j using Lemma 15 from the appendix 
to bound the summation. □ 

In order to make the failure probability an arbitrarily small constant, it is sufficient to take n a = 
O(d k) d ° max(fc, In 1/5) for spill trees and n Q — O(d kln(d k)) d ° max(fc, In 1/5) for RP trees. 

4.2 A document model 

In a bag-of-words model, a document is represented as a binary vector in {0, 1} , where N is the size of the 
vocabulary and the ith coordinate is 1 if the document happens to contain the corresponding word. This is 
a sparse representation in which the number of nonzero positions is typically much smaller than N. 

Pick any query document q € {0, 1} , and suppose that x\, . . . , x n are generated i.i.d. from a topic model 
\i. We will consider a simple such model with t topics, each of which follows a product distribution. The 
distribution /i is parametrized by the mixing weights over topics, wi, . . . ,wt, which sum to one, and the word 
probabilities (pi , ■ ■ ■ >P$) f° r eacrl topic 1 < j < t. Here is the generative process for a document x: 

• Pick a topic 1 < j < t, where the probability of picking j is Wj. 

• Set the coordinates of x € {0, 1} N independently; the ith coordinate is 1 with probability p^K 

The overall distribution is thus a mixture /i = w±fii + ■ ■ ■ +WtfJ,t whose jth component is a Bernoulli product 
distribution fij = B(p^p) x ••• x B{p% ] ). Here B{p) is a shorthand for the distribution on {0, 1} with 

expected value p. It will simplify things to assume that < < 1/2; this is not a huge assumption if, say, 
stopwords have been removed. 

For the purposes of bounding $, we are interested in the distribution of dn(q, X), where X is chosen from 
H and dn denotes Hamming distance. This is a sum of small independent quantities, and it is customary 
to approximate such sums by a Poisson distribution. In the current context, however, this approximation is 
rather poor, and we instead use counting arguments to directly bound how rapidly the distribution grows. 
The results stand in stark contrast to those we obtained for doubling measures, and reveal this to be a 
substantially more difficult setting for nearest neighbor search. For a doubling measure, the probability 
mass of a ball B(q, r) doubles whenever r is multiplied by a constant. In our present setting, it doubles 



whenever r is increased by an additive constant. Specifically, it turns out (Lemma 12) that for I < L/8, 

Pr(d H (q,X)=£+l) 



Pr(d H (q,X)=£) 



> 4. 
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Here L = min(Li, ... ,L t ), where Lj is the expected number of words in a document drawn from fij, that 
is, Lj =p[ j) + ---+p% ) . 

We start with the case of a single topic. 



4.2.1 Growth rate for one topic 

Let q £ {0, 1}^ be any fixed document and let X be drawn from a Bernoulli product distribution B(pi) x 
• • • x B(pn). Then the Hamming distance dn(q,X) is distributed as a sum of Bernoullis, 

d H {q,X) ~ B(oi) + --- + B(a N ), 

where 

a = f Pi if ft = 
\ 1 -pi if ft = 1 

To understand this distribution, we start with a general result about sums of Bernoulli random variables. 
Notice that the result is exactly correct in the situation where all pi = 1/2. 

Lemma 11 Suppose Z\, . . . , Zn are independent, where Zi € {0,1} is a Bernoulli random variable with 
mean < < 1, and ai > a-i > ■ ■ ■ > ajv- Let Z = Z\ -\ + Zjy. Then for any £ > 0, 

Vi{Z = £+l) > 1 ^ a, 



Proof: Define = aj/(l — a 4 ) e (0, oo); then n > r 2 > ■ ■ ■ > r N . Now, for any £ > 0, 
Pr(Z = £) = a n^2 ■■■ a it II ~ a ^ 

{*!,. ..,ii}C [N] j'g{*i,->*<} 
N 

n/i _ \ a ii a i2 a ie 

{ a%) ^ I- a- 1-a- "' l-a- 

i=i {„ i,)c[»] 11 12 H 

N 

= n^ 1 "^) r n n 2 ---r le 

i=l {i u ..., i t } C [N] 

where the summations are over subsets {ii,. ■ ■ ,ie} of £ distinct elements of [N]. In the final line, the product 
of the (1 — does not depend upon £ and can be ignored. Let's focus on the summation; call it Si. We 
would like to compare it to Sg + \. 

Si + \ is the sum of (^) distinct terms, each the product of £ + 1 r^s. These terms also appear in the 
quantity Sg(ri + • • • + r/v); in fact, each term of SV+i appears multiple times, £ + 1 times to be precise. The 
remaining terms in Si(r\ + ■ ■ ■ + rjv) each contain £ — 1 unique elements and one duplicated element. By 
accounting in this way, we get 

Se(n-\ \-r N ) = {£+l)S t+ i+ ^ r n r i 2 ■ ■ ■ n e (r h H hr-jj 

{ii,. ..,i t }c [N] 
< (£+l)S e+1 +S e (n + ---+r e ) 

since the rVs are arranged in decreasing order. Hence 

Pr(Z = £ + l) S i+1 1 
Pr(Z = £) = " —l (n+1 + -- + rN) > 

as claimed. □ 

We now apply this result directly to the sum of Bernoulli variables Z = dn(q,X). 
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Lemma 12 Suppose that p\, ■ ■ ■ ,pn S (0, 1/2). Pick any query q £ {0, 1} N , and draw X from distribution 
p = B(pi) x • • • x B{pn). Then for any £ > 0, 

Pr(d H (g,X)=£+l) L-e/2 
Pt(d H (q,X)=£) ~ £+1 ' 

where L — YliPi * s ^ e expected number of words in X . 

Proof: Suppose q contains k a nonzero entries. Without loss of generality, these are q%, . . . , qk o - 

As we have seen, dn{q, X) is distributed as the Bernoulli sum B(l — pi) + ■ ■ ■ + B(l —pk a ) + B(pk a +i) + 
• • • + B(p N ). Define 

r = f (1 -Pi)/ Pi if i < fc 
\ Pi/(1 - Pi) if i > k a 

Notice that r^ > 1 for i <k Q , and < 1 for i > k ; and that > p 4 always. 
By Lemma [TT] we have that for any £ > 0, 

Pr(rf g (q, 1 V 

Pr(d K ( g ,JQ=£) " ^ + 1^ W ' 

where rvi) > • • • > r(jy) denotes the reordering of r±, . . . , r^r into descending order. Since each r; > p±, and 
each pi is at most 1/2, 

X! r W ^ (sum of iV -£ smallest p^'s) > (^p<) - ^/ 2 = L-i/2. 

i>l i 

□ 

4.2.2 Growth rate for multiple topics 

Now let's return to the original model, in which X is chosen from a mixture of t topics p — w\p\ + • • • + Wtpt, 
with fij = B(p{ 3) ) x ■ ■ • x B(p% ] ). Then for any I, 

t 

P r (d H (q,X) = £ | X ~ fi) = J2 w j Pl ( d n(Q,X) =t\X~ pj). 

j=i 

Combining this relation with Lemma [12] we immediately get the following. 

Corollary 13 Suppose that all p^ £ (0,1/2). Let Lj — 53iPi denote the expected number of words in a 
document from topic j , and let L = min(Li, . . . , L t ). Pick any query q £ {0, 1}^, and draw X ~ p. For any 
£>0, 

Pr(d H {q,X)=£+l) > L-e/2 
Pi(d H (q,X) = £) ~ £+1 ' 



4.2.3 Bounding $ 

Fix a particular query q £ {0, 1}^, and draw x\, ...,x n from distribution p. Let the random variable Si 
denote the points at Hamming distance exactly £ from q, so that E\S(\ = nPr x ~pi{dH{q, X) = £). 

Lemma 14 There is an absolute constant c for which the following holds. Pick any < 8 < 1 and any 
k > 1, and let v denote the smallest integer for which PTx~n{dii{q, X) < v) > (8/n) max(fc, In l/<5). TVien 
with probability at least 1 — 3<5, 

(a) |S | + ... + |5„| > k. 
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(b) Ifv< Co L then \S \ + ■■■ + \S v -i\ < \S V \. 

(c) For all v < £ < c L, we have \St+i\/\Si\ > 2. 
If (a, b, c) hold, then for any m < n, 



$k,m(q>{ x l)---> x n}) < 4j : . 

c Q L — log 2 (n/m) 



Proof: Parts (a, b, c) are shown by applying multiplicative Chernoff bounds to the result of Corollary 13 
The details are very similar to those of Theorem [9j and hence we omit them and turn to bounding $. 
Suppose that for some i > k, point xu\ is at Hamming distance I from q, that is, xr{\ G Si. Then 

(Ik-^(i)ll + • ■■ + \\q-x(k)\\)/k < fv 



since Euclidean distance is the square root of Hamming distance. In bounding $fc iTn , we need to gauge the 
range of Hamming distances spanned by x^+i), ■ ■ ■ , xt m y 

The geometric growth rate of part (c) implies that most points lie at Hamming distance c Q L or greater 
from q. It also means that dg(q,Xr m )) > c a L — log 2 (n/m). Thus, 



77i f-^ \ £ V c a L — logofn/m) 

where the last step follows by lower-bounding \Si\ by an increasing geometric series. □ 

The implication of this lemma is that for any of the three tree data structures, the failure probability at 
a single level is roughly Wv/L. This means that the tree can only be grown to depth 0(y/L/v), and thus 

the query time is dominated by n a = n ■ 2~°^ L / V \ 

When n is large, we expect v to be small, and thus the query time improves over exhaustive search by a 
factor of roughly 2 _v/ ^. 
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A Technical lemma 

Lemma 15 Suppose that for some constants A, B > and d > 1, 

B^ 1/d ° 



l/d 



F(m) < A[ — 

for all m > n a . Pick any < < 1 and define £ = log 1 / (3 (n/n ). Then: 
and, ifn > B(A/2) d °, 

2e Ad n ( B\ 1/d ° ( 1 1 2e 1 . n r 



i=0 



Y F{0 l n)\n^— < ——p. I — ) — ^ ln-+ln^ + -ln 
^ KP ' F(pn) ~ l-0\n o J \l-0 A d B 



Proof: Writing the first series in reverse, 



i=0 i=0 x ' 7 i=0 



n 

i/d e 



i/d 

i=0 



< A /B\ 1/d ° < Ad n 



1-0V*> \n J ~ 1-0 \n 

The last inequality is obtained by using 

(1 - xf >l-px for < x < 1, p > 1 

to get (1 - (1 - 0)/d o ) d ° > and thus 1 - 1 / d ° > (1 - 0)/d o . 

Now we move on to the second bound. The lower bound on n implies that A{B/myi d ° < 2 for all 
m > n Q . Since xln(2e/x) is increasing when x < 2, we have 
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The lemma now follows from algebraic manipulations that invoke the first bound as well as the inequality 
which in turn follows from 

□ 
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